# Multimodal Video Understanding

Cosmos Reason1 7B GGUF
Other
Cosmos-Reason1 is a physics AI model developed by NVIDIA, capable of understanding physical common sense and generating embodied decision-making natural language through long-chain reasoning.
Text-to-Video Transformers English
C
unsloth
6,690
1
Cosmos Reason1 7B
Other
Cosmos-Reason1 is a physical AI model developed by NVIDIA, capable of understanding physical common sense and generating embodied decisions through long-chain reasoning.
Transformers English
C
nvidia
18.56k
72
Anon
Apache-2.0
A fine-tuned version based on the lmms-lab/llava-onevision-qwen2-7b-ov model, supporting video-text-to-text conversion tasks.
English
A
aiden200
361
0
Internvideo2 Stage2 6B
MIT
InternVideo2 is a multimodal video understanding model with 6B parameters, focusing on video content analysis and comprehension tasks.
Video-to-Text Safetensors
I
OpenGVLab
542
0
Qwen2.5 VL 72B Instruct Pointer AWQ
Other
Qwen2.5-VL is the latest vision-language model in the Qwen family, featuring enhanced visual understanding, agent capabilities, and structured output generation.
Image-to-Text Transformers English
Q
PointerHQ
5,592
8
VL3 SigLIP NaViT
Apache-2.0
The visual encoder for VideoLLaMA3, utilizing Arbitrary Resolution Visual Tokenization (AVT) technology to dynamically process images and videos of different resolutions.
Text-to-Image Transformers English
V
DAMO-NLP-SG
25.55k
8
Videollama2.1 7B 16F Base
Apache-2.0
VideoLLaMA2.1 is an upgraded version of VideoLLaMA2, focusing on enhancing spatiotemporal modeling and audio understanding capabilities in large video-language models.
Video-to-Text Transformers English
V
DAMO-NLP-SG
179
1
Videollama2.1 7B 16F
Apache-2.0
VideoLLaMA 2 is a multimodal large language model focused on video understanding, equipped with spatiotemporal modeling and audio comprehension capabilities.
Text-to-Video Transformers English
V
DAMO-NLP-SG
2,813
10
Videollama2 72B
Apache-2.0
VideoLLaMA 2 is a multimodal large language model focused on video understanding and spatio-temporal modeling, supporting video and image inputs, capable of performing visual question answering and dialogue tasks.
Text-to-Video Transformers English
V
DAMO-NLP-SG
26
10
Tarsier 34b
Apache-2.0
Tarsier-34b is an open-source large-scale video-language model focused on generating high-quality video captions and achieving leading results in multiple public benchmarks.
Video-to-Text Transformers
T
omni-research
103
17
Videollama2 8x7B Base
Apache-2.0
VideoLLaMA 2 is a next-generation video large language model, focusing on enhancing spatiotemporal modeling and audio understanding capabilities, supporting multimodal video question answering and description tasks.
Text-to-Video Transformers English
V
DAMO-NLP-SG
20
2
Videollama2 8x7B
Apache-2.0
VideoLLaMA 2 is a multimodal large language model focused on video understanding and audio processing, capable of handling video and image inputs to generate natural language responses.
Text-to-Video Transformers English
V
DAMO-NLP-SG
21
3
Llava NeXT Video 34B Hf
LLaVA-NeXT-Video is an open-source multimodal chatbot trained on mixed video and image data, excelling in video understanding capabilities.
Text-to-Video Transformers English
L
llava-hf
2,232
7
Llava NeXT Video 7B DPO Hf
LLaVA-NeXT-Video is an open-source multimodal chatbot optimized through mixed training on video and image data, possessing excellent video understanding capabilities.
Video-to-Text Transformers English
L
llava-hf
12.61k
9
Llava NeXT Video 7B Hf
LLaVA-NeXT-Video is an open-source multimodal chatbot that achieves excellent video understanding capabilities through mixed training on video and image data, reaching SOTA level among open-source models on the VideoMME benchmark.
Text-to-Video Transformers English
L
llava-hf
65.95k
88
Sharegpt4video 8b
Apache-2.0
ShareGPT4Video-8B is an open-source video chatbot, fine-tuned on open-source video instruction data.
Text-to-Video Transformers
S
Lin-Chen
1,973
44
Xclip Base Patch16 Kinetics 600 16 Frames
MIT
X-CLIP is an extension of CLIP for general video-language understanding, supporting zero-shot, few-shot, or fully supervised video classification, as well as video-text retrieval tasks.
Text-to-Video Transformers English
X
microsoft
393
2
Xclip Base Patch16 Kinetics 600
MIT
X-CLIP is an extended version of CLIP for general video-language understanding, trained via contrastive learning on (video, text) pairs.
Text-to-Video Transformers English
X
microsoft
294
1
Xclip Large Patch14
MIT
X-CLIP is an extension of CLIP for general video-language understanding, trained via contrastive learning on (video, text) pairs.
Text-to-Video Transformers English
X
microsoft
1,698
11
Xclip Base Patch16 16 Frames
MIT
X-CLIP is a minimalist extension of CLIP for general video-language understanding, trained via contrastive learning on (video, text) pairs.
Text-to-Video Transformers English
X
microsoft
1,034
0
Xclip Base Patch32 16 Frames
MIT
X-CLIP is an extended version of CLIP for general video-language understanding, trained on video-text pairs via contrastive learning, suitable for tasks like video classification and video-text retrieval.
Text-to-Video Transformers English
X
microsoft
901
4
Xclip Base Patch32
MIT
X-CLIP is an extended version of CLIP for general video-language understanding, trained on (video, text) pairs via contrastive learning, suitable for tasks like video classification and video-text retrieval.
Text-to-Video Transformers English
X
microsoft
309.80k
84
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase